Home Credit is an international consumer finance provider that lends money primarily to those with little or no credit history. They created a kaggle competition where users would use machine learning and statistical methods to predict the loan default risk of individuals based on Home Credit's loan applicant data. Although Home Credit had already used machine learning to project default risk, they hoped to find ways to improve their predictive ability based on what the Kagglers produced.
Home Credit serves those without credit history or are unbanked, who are likely to be viewed as high risk for a loan default, even if they are financially responsible and will make the necessary repayments. Home Credit needs to avoid making loans to those who will end up unable to complete payments, as this can result in a larger financial cost than the gain of several successful loans. They also want to make their services as accessible as possible through providing the loans to all those who are truly eligible. Machine learning can help us manage these risks and rewards. By building an accurate model, we will be providing Home Credit the ability to provide more loans, since loan applicants identified by our model as at lower risk for a loan default will be less likely to default on the loan.
The main research question is whether we can create a model that can signal whether a loan applicant will default and is interpretable. By interpretable, I mean that one can understand the relationships the model has captured, as well as give justification as to why a prediction was made. This flows into another goal; finding the features that best predict a default.
The data was retrieved from the Home Credit Defaul Risk Prediction competition on kaggle.com (https://www.kaggle.com/c/home-credit-default-risk/data). The data and supporting information was contained in 10 files, and is 2.68 GB in size.
The training dataset is contained in application_train.csv, and contains a total of 307511 rows.
Below is view of the first 5 rows:
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
There are a 122 columns in total, so the above dataset view only contains a fraction of these.
56 of these are categorical, while the rest are numerical.
I will also be using bureau.csv and bureau_balance.csv, files containing data on existing and past loan balances at other institutions for the loan applicant. The data I am interested in is in bureau_balance.csv, so while there are several columns in bureau.csv, I am only interesting in using the two id columns it contains. These will allow me to match the data in application_train.csv to bureau_balance.csv. Here is the view of the first 5 rows of bureau_balance:
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
This is a large dataset, as it contains a record for every month and every previously existing/closed loans for the applicants.
Our target variable is whether the loan applicant defaulted. It is a boolean value, with a value of 0 representing no default, and 1 signaling a loan applicant that ended up defaulting. This is an unbalanced data point, as there are almost 10 times more succesfull loans than defaults.
Several variables contain information about the individual applying, including demographics like gender or education status, or miscellaneous information like whether the person owns a home or car. These variables are mainly boolean or categorical.
This group of variables describe the loan itself, such as the amount of money being borrowed, the time the approval process began, and what documentation was provided. These variables are of all types, including numerical.
These variables contain the financial information of the applicant. They include normalized scores (I believe these scores act like a credit score, but Home Credit does not state this in the column description they provided) and the amount of enquiries to Credit Bureau about the client in the time prior to the application. These variables are mainly numerical.
Home Credit has provided columns containing the applicant's annual income and field of work. These variables are of all types.
A fair amount of variables contain data about the property the client lives in. These are numeric, and have been normalized.
This group of variables describe the region where the client lives, including the normalized population and Home Credit's rating of the region.
Most columns are missing less than 1% of their data, but there are several exceptions. The property info columns are missing the majority of their values. Other variables with a large share of missing values are the applicant's car age, occupation type, two of their scores (normalized financial scores), and the Credit Bureau enquiry counts. I dropped the property columns as well as the applicant's car age since it is missing over 2/3 of its data.
For categorical columns with missing values, I chose 1 category as the default category (a category that shouldn't have effect on default probability), or created a new "unknown" category. The remaining null values were imputed using KNN Imputation with 5 neighbors. In order to do the imputation, I selected a subset of columns to feed the imputer along with the columns requiring imputation after scaling them. I chose a subset in order to reduce computation time, as KNN Imputation is an expensive process. With these changes, the resulting dataset has no missing values. Future researchers may need to consider including the columns that were dropped, as I have not demonstrated they had no relationship with default rate.
Home Credit has also provided a file with past loan balances at other institutions for the loan applicant, titled bureau_balance.csv. I aggregated this file by loan to get the count of months where the loan was late on payment. I then matched up those loans to the applicants using the data in bureau.csv. Applicants can have multiple loans, so for each applicant, I took the total number of defaults accross all their loans. Finally, I added the total default count as a new variable in our training dataset by doing a left join on the applicant id, where the left dataset is our original training dataset.
3 new variables were created from the data provided by Home Credit. The first is the social circle default ratio columns. These 2 columns our based on 4 other columns in the original dataset that contain the amount of observations along with the amount of defaults in the client's social circle. By creating a ratio, we should be able to know the true propensity of defaults in the client's social circle as a ratio, instead of only going off the raw default count. These values are generated for a 30 day default and 60 day default.
The other new column was the amount of requests to the Credit Bureau over the year prior to the application. In the original table, the total number of requests are split between the last hour, day, week, month, quarter, and year, excluding the observations added to another column. For example, the year request count excludes the count from the last quarter. I will be adding these columns together for one final annual total, as I don't believe we have enough observations for the smaller periods for them to be useful. Finally, the column from the bureau aggregation will also be added to the training dataset.
After our column drops and additions, we have 78 columns in our cleaned data, 52 of which are categorical and 26 numerical. Below is a views of the first 5 rows of this data.
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | DEF_30_RATIO_SOCIAL_CIRCLE | DEF_60_RATIO_SOCIAL_CIRCLE | AMT_REQ_CREDIT_BUREAU | DEFAULT_COUNT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 27.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Chen et. al. "Using Random Forest to Learn Imbalanced Data". https://unomaha.instructure.com/courses/51016/pages/teaching-presentation
Leo Breiman. "Random Forests". https://link.springer.com/content/pdf/10.1023%2FA%3A1010933404324.pdf
A random forest classifier consists of multiple trees, all providing predictions given an input. Each of these trees is trained using a random subset of variables and observations from the original input. When dealing with imbalanced data, random forests will often ignore the minority class, as the standard error/objective functions provide little penalty for misclassified minority class observations due to their low amount. One way to prevent this is applying a heavier penalty for misclassifying a minority class, which can be done by applying a large weight to minority classes. A normal Gini function (the function that calculates the impurity of a tree node) will look like the below:
$$1 - [(Ratio_{0})^2 + (Ratio_{1})^2]$$A weighted version looks like this:
$$1 - [w_{0}(Ratio_{0})^2 + w_{1}(Ratio_{1})^2]$$A balanced random forest preserves the same basic algorithm as the above, except it does not modify the Gini function. Instead, it randomly under samples the more frequent label on each bootstrapped sample, "balancing" the problem.
I used the implementation of the Random Forest Classifier in Python's scikit-learn package to build the weighted model. scikit-learn comes with many different machine learning models, as well as functions to preprocess data and evaluate model results. Due to scikit-learn's well know API, it was easy to create the model and generate predictions. The balanced model comes from the imblearn package. This package is focused on dealing with unbalanced data, and its models follow a scikit-learn's API as well.
Both these methods were covered in a class presentation and were demonstrated to have been able to handle unbalanced data. As loan defaults are much less common than succesful loans, these models are potential solutions to detect loans that will default. Additionally, random forest models are widely known to be affective of capturing relationships between independent variables and a target variable.
InterpretML. "Explainable Boosting Machine". https://interpret.ml/docs/ebm.html
Harsha Nori et. al. "InterpretML: A Unified Framework for Machine Learning". https://arxiv.org/pdf/1909.09223.pdf
Yin Lou et. al. "Intelligible Models for Classification and Regression". https://www.cs.cornell.edu/~yinlou/papers/lou-kdd12.pdf
Explainable Boosting Machine (EBM) is a generalized additive model (GAM), having the form:
$$g(E[y]) = \beta _{0} + \Sigma f_{j}(x_{j})$$There are two major differences between EBM and a standard GAM. First, each feature function $f_{j}$ is tuned using techniques like bagging and gradient boosting. Second, automatic pairwise interaction detection is supported, so the EBM can be represented as:
$$g(E[y]) = \beta _{0} + \Sigma f_{i}(x_{i}) + \Sigma f_{i,j}(x_{i},x_{j})$$Since EBM is a GAM, it is very interpretable. Each $f_{j}$ can be visualized across $x_{j}$, allowing us to understand the relationship between $x_{j}$ and $y$ the model has captured. We can also know how each feature contributes to the final prediction, as it is merely the sum of the outputs of all $f_{j}$.
With weights, we can force the EBM to pay closer attention to the minority class. Without them, it will just predict the majority label.
EBM can be found in InterpretML, a package containing several machine learning interpretability methods. InterpretML was developed at Microsoft Research and made open source. EBM is an implemetation of the algorithm proposed by Lou et. al. ("Intelligible Models for Classification and Regression"), and follows scikit-learn's API. As part of its "interpretable" nature, visuals revealing the model's structure or explaining its predictions are easily created.
Giving a loan applicant a higher probability for default could have serious consequences for whether they end up getting the loan they need. Before we accept the model's assesment, we should understand what is behind it. A model may have identified a statistically significant relationship between a variable and default risk, but this relationship may be the result of bad data or some other error. There is a lot about the world of loan applications that a model cannot capture. By having the model explain or justify its estimate, we allow humans to make an educated decision off of the model's prediction, rather than trusting its assesment without verification. With other complex models that don't have this visibility, it is much more difficult to assess why the model made the prediction it did.
Another reason is related to understanding the relationships between our independent variables and default risk. While complex models don't clearly reveal the relationships that have been captured, EBM does this with little effort. This can be valuable at a global level, as Home Credit could understand what attributes of applicants are related to higher loan risk. There is also value at a local level. For an individual applicant, we can understand what characteristics are driving their default risk. Rather than just using the model to accept or deny an application, the model's output could be used to reveal when and why an applicant may be in need of additional resources to prevent a default.
We have 78 columns in our final dataset. I have selected a subset of these that showcase the kinds of relationships the independent variables have with our target variable.
The score variables (column names EXT_SOURCE_1, 2, and 3) have the most clear relationship with the default count. The below graph displays the default rate for different buckets of the second score. As the score increases, defaults become less common.
The default rate has an increasing relationship with the number of previous defaults. Unlike the score relationship, the relationship is not large. The ratio of defaults is only a couple percentage points higher for applicants with 4+ defaults compared to those with none.
Unlike the scores and number of past loan defaults, the relationship between the loan credit amount and default rate is not monotonic. The highest ratio of default counts are found in loans between 400,000 and 600,000. One explanation is that low value loans might have lower payments, so defaults are less common, while high value loans will be harder to pay off, but Home Credit may increase its strictness on who gets the loan, leading to us observing less defaults. The loans between these extremes occur at a sweetspot on difficuly to pay back and review strictness, leading us to see more defaults.
Females appear to be less likely to default than males.
Like with gender, default rate seems to differ for different occupations. The default ratio for drivers is much higher than accountants and high skilled tech workers.
The relationships visualized above are a small sample of all the features in our dataset, and the diversity they exhibit is reflective of the entire dataset. Some of the relationships are monotonic, while others are more complex. Many columns are binary, while others are multiclass. These characteristics justify my decision to use advanced machine learning methods like random forest and EBM.
I used two different approaches to train a random forest model that could handle imbalanced data. First, I used weights, penalizing minority class missclassifications at a higher amount. This "weighted" random forest model failed to pay sufficient attention to the positive label. Although I modified the parameters for the random forest model in many different ways, I was unable to raise the precision for the positive label while producing reasonable results. Below are some metrics, calculated on our hold out dataset:
Accuracy score: 0.9187 Precision: 0.0004099200655872105 Recall: 0.6666666666666666
Though the accuracy score is high, you can see from the precision that we are identifying almost none of the loans that ended up with a default. The confusion matrix below reveals that the classifier only identified a couple of the defaulted loans.
array([[55120, 4877],
[ 1, 2]], dtype=int64)
The second approach was to use a balanced random forest, where undersampling of the majority class forces the model to pay closer attention to the minority class. This approach enjoyed greater success:
Accuracy score: 0.6858333333333333 Precision: 0.6706292273006764 Recall: 0.15949305386302706
For the above, you can see the model correctly sacrifices accuracy for greater precision. The confusion matrix below shows we identifed 3145 defaults, which is almost 2/3 of the defaults that actually occured in the test dataset.
array([[37878, 1607],
[17243, 3272]], dtype=int64)
Below is a plot of which features our balanced random forest model found useful:
As expected, scores play a major role in succesfully predicting loan defaults. While we can use the above to find what features the model is using to make predictions, we cannot use it to know how these features contributed to our final predictions. In order to meet the explainability goal of this project, I used Shapley values. Shapley values originally come from game theory, and their intent is to represent what each feature contributes to the final output. Below is an example.
The above explains an individual prediction given by our balanced random forest model. In this case, we correctly identified a loan application that defaulted. As the visual shows, the low financial test scores contributed the most to our default prediction, while the other features contributed very little. One drawback to calculating shapley values for our predictions is computation time. Each prediction takes several seconds to produce shapley values for. Additionally, these shapley values are estimates, and do not necessarily reflect the true contribution of each feature.
Like the balanced random forest, the EBM with weights was also able to sacrifice accuracy to identify defaulted loans. Here are some metrics calculated on our hold out dataset:
Accuracy score: 0.69975 Precision: 0.6699772398096421 Recall: 0.1647166547970292
The precision matches the balanced random forest model, and the confusion matrix below also reveals that the classifier is paying attention to the positive class. Our accuracy may drop, but since we are dealing with an imbalanced dataset we want to trade off accuracy for correctly labeling the positive class.
array([[38747, 1595],
[16420, 3238]], dtype=int64)
As mentioned previously, EBMs are designed to be interpretable, and an aim of this project is to understand what the model has captured. In the below plot, we can see its most important features.
The scores are the most important, followed by gender, education level, and the loan amount. Since the EBM is a GAM, we can also visualize the relationship between each dependent variable and the target. Below is the relationship with the third score. As we saw in a previous visual, higher scores are associated with less default risk.
The visual of education type's relationship with target is very valuable. Having completed higher education leads to a default risk much lower than only having completed secondary. We can also tell from the density section of the visual that the other 3 categories are less common, so we should be careful on what we claim about their relationship with default risk.
Another powerful feature of an EBM is the ability to see why an observation was assigned a label. These visuals are similar to the shapley plots we used to understand the balanced random forest model. However, these visuals were created in less than a second, and the values represent the true contribution of each feature. Below are 2 examples of EBM's label justification. The first is a true negative. The applicant has great a great value for the 3rd financial score, along with favorable employment history and purchase amount, leading to a prediction of no default.
In the 2nd example, though the financial scores looked bad and lead to a default prediction, the loan applicant did not end up defaulting.
I noticed that many of the relationships the EBM captured looked noisy. The example of the count of children can be seen below.
This relationship is more likely to be caused by a low amount of observations at higher values rather than a real trend. By default, EBMs have a minimal sample size of 2 for each leaf. I increased that number to 1000 to force the EBM to only create leaves for significant sample sizes. Below are the resulting accuracies.
Accuracy score: 0.69935 Precision: 0.6720463480240016 Recall: 0.16485635976043042
When we raised the minimum number of samples in every leaf parameter from 2 to 1000, we do not see a significant difference performance wise. I also found that the relationship between count of children and default became much less volatile (note the y axis).
My primary goal was to create a model that detects loan defaults and is interpretable. Both the EBM and Balanced Random Forest models have a precision around 66%, meaning we can identify the majority of loan applicants that end up defaulting. My explainabitility requirement is met by the EBM by design, and with shapley values for the random forest model. The visuals these methods can generate can provide the justification necessary for Home Credit to make informed loan decisions based on the model's prediction.
My secondary goal was to identify the features that best predict loan defaults. The financial test scores were the most related to loan defaults by a wide margin. Other important features include gender, education, and employment length. Another interesting finding was the performance of the EBM. It is often assumed that having an inherently interpretable model comes at the cost of accuracy. For example, linear models are very understandable due to their simple nature, but their simplicity comes at the cost of not being able to capture complex relationships. But our EBM model demonstrates that performance can go hand in hand with explainability. Both it and the random forest model had identical performance, showcasing its ability to capture the complex relationships that exist in the dataset.
Although our precision was high, our recall was only about 16% for both models. Even with significant data engineering and modeling work, I could not maintain increased precision without dropping accuracy significantly. I also found that the 3 features I created were unsatisfying in their contribution to each model's ability to identify loan defaults. They were not found in the top 10 of either model's importances. Only the historical default count appeared to have a real relationship with loan defaults. Another disappointment was the weighted random forest. It did not seem to be able to properly weight default observations, and I was unable to figure out why.
There are a couple different areas within the project that could benefit from future work. First, I had to exclude property data. It is possible understanding this data could assist us in predicting which loans will end up with a default. Another area of additional work is utilizing a metric that will help us balance precision, recall, and accuracy. The Kaggle competition used the Area Under Receiver Operating Characteristic Curve to score submissions. Given more time, I would use this metric as the value to maximize, which would have helped me tune the models and compare them with each other.